How Diachronic Text Corpora Affect Context based Retrieval of OOV Proper Names for Audio News

نویسندگان

  • Imran A. Sheikh
  • Irina Illina
  • Dominique Fohr
چکیده

Out-Of-Vocabulary (OOV) words missed by Large Vocabulary Continuous Speech Recognition (LVCSR) systems can be recovered with the help of topic and semantic context of the OOV words captured from a diachronic text corpus. In this paper we investigate how the choice of documents for the diachronic text corpora affects the retrieval of OOV Proper Names (PNs) relevant to an audio document. We first present our diachronic French broadcast news datasets, which highlight the motivation of our study on OOV PNs. Then the effect of using diachronic text data from different sources and a different time span is analysed. With OOV PN retrieval experiments on French broadcast news videos, we conclude that a diachronic corpus with text from different sources leads to better retrieval performance than one relying on text from single source or from a longer time span.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning to retrieve out-of-vocabulary words in speech recognition

Many Proper Names (PNs) are Out-Of-Vocabulary (OOV) words for speech recognition systems used to process diachronic audio data. To help recovery of the PNs missed by the system, relevant OOV PNs can be retrieved out of the many OOVs by exploiting semantic context of the spoken content. In this paper, we propose two neural network models targeted to retrieve OOV PNs relevant to an audio document...

متن کامل

Neural Networks Revisited for Proper Name Retrieval from Diachronic Documents

Developing high-quality transcription systems for very large vocabulary corpora is a challenging task. Proper names are usually key to understanding the information contained in a document. To increase the vocabulary coverage, a huge amount of text data should be used. In this paper, we extend the previously proposed neural networks for word embedding models: word vector representation proposed...

متن کامل

Continuous word representation using neural networks for proper name retrieval from diachronic documents

Developing high-quality transcription systems for very large vocabulary corpora is a challenging task. Proper names are usually key to understanding the information contained in a document. One approach for increasing the vocabulary coverage of a speech transcription system is to automatically retrieve new proper names from contemporary diachronic text documents. In recent years, neural network...

متن کامل

Improved Neural Bag-of-Words Model to Retrieve Out-of-Vocabulary Words in Speech Recognition

Many Proper Names (PNs) are Out-Of-Vocabulary (OOV) words for speech recognition systems used to process diachronic audio data. To enable recovery of the PNs missed by the system, relevant OOV PNs can be retrieved by exploiting the semantic context of the spoken content. In this paper, we explore the Neural Bag-of-Words (NBOW) model, proposed previously for text classification, to retrieve rele...

متن کامل

Proper name retrieval from diachronic documents for automatic speech transcription using lexical and temporal context

Proper names are usually key to understanding the information contained in a document. Our work focuses on increasing the vocabulary coverage of a speech transcription system by automatically retrieving new proper names from contemporary diachronic text documents. The idea is to use in-vocabulary proper names as an anchor to collect new linked proper names from the diachronic corpus. Our assump...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016